本文介绍了一种无监督的基于分段的稳健语音活动检测方法(RVAD)。该方法包括两个去噪之后的传递,然后是语音活动检测(VAD)阶段。在第一通道中,通过使用后验信噪比(SNR)加权能量差来检测语音信号中的高能段,并且如果在段内没有检测到间距,则该段被认为是高能量噪声段并设置为零。在第二种通过中,语音信号由语音增强方法进行去噪,探索了几种方法。接下来,具有间距的相邻帧被分组在一起以形成音调段,并且基于语音统计,俯仰段进一步从两端延伸,以便包括浊音和发声声音和可能的非语音部分。最后,将后验SNR加权能量差应用于用于检测语音活动的去噪语音信号的扩展桨距片段。我们使用两个数据库,大鼠和极光-2评估所提出的方法的VAD性能,该方法包含大量噪声条件。在扬声器验证性能方面进一步评估RVAD方法,在Reddots 2016挑战数据库及其噪声损坏版本方面。实验结果表明,RVAD与许多现有方法有利地比较。此外,我们介绍了一种修改版的RVAD,其中通过计算有效的光谱平坦度计算替换计算密集的俯仰提取。修改的版本显着降低了适度较低的VAD性能成本的计算复杂性,这是在处理大量数据并在低资源设备上运行时的优势。 RVAD的源代码被公开可用。
translated by 谷歌翻译
精益燃烧是环境友好的,NOX排放量低,并且在燃烧系统中还提供了更好的燃油效率。但是,接近瘦燃烧会使引擎更容易容易倾斜。精益井喷(LBO)是一种不希望的现象,可能会导致突然的火焰灭绝,从而导致突然失去权力。在设计阶段,对于科学家来说,准确确定最佳的操作限制以避免突然发生LBO的情况非常具有挑战性。因此,至关重要的是,在低NOX排放发动机中开发准确且可计算的框架来在线LBO检测。据我们所知,我们第一次提出了一种深度学习方法来检测燃烧系统中的精益井喷。在这项工作中,我们利用实验室规模的燃烧器收集不同协议的数据。对于每个协议,我们远离LBO,并逐渐朝LBO制度移动,在每个条件下捕获一个准静态时间序列数据集。使用数据集中的一个协议作为参考协议,并在域专家注释的条件下,我们找到了经过培训的深度学习模型的过渡状态指标,以在其他测试协议中检测LBO。我们发现,我们所提出的方法比其他基线模型更准确和计算更快,以检测到LBO的过渡。因此,我们建议使用瘦燃烧引擎中实时性能监视的方法。
translated by 谷歌翻译
Machine Translation (MT) system generally aims at automatic representation of source language into target language retaining the originality of context using various Natural Language Processing (NLP) techniques. Among various NLP methods, Statistical Machine Translation(SMT). SMT uses probabilistic and statistical techniques to analyze information and conversion. This paper canvasses about the development of bilingual SMT models for translating English to fifteen low-resource Indian Languages (ILs) and vice versa. At the outset, all 15 languages are briefed with a short description related to our experimental need. Further, a detailed analysis of Samanantar and OPUS dataset for model building, along with standard benchmark dataset (Flores-200) for fine-tuning and testing, is done as a part of our experiment. Different preprocessing approaches are proposed in this paper to handle the noise of the dataset. To create the system, MOSES open-source SMT toolkit is explored. Distance reordering is utilized with the aim to understand the rules of grammar and context-dependent adjustments through a phrase reordering categorization framework. In our experiment, the quality of the translation is evaluated using standard metrics such as BLEU, METEOR, and RIBES
translated by 谷歌翻译
Modern telecom systems are monitored with performance and system logs from multiple application layers and components. Detecting anomalous events from these logs is key to identify security breaches, resource over-utilization, critical/fatal errors, etc. Current supervised log anomaly detection frameworks tend to perform poorly on new types or signatures of anomalies with few or unseen samples in the training data. In this work, we propose a meta-learning-based log anomaly detection framework (LogAnMeta) for detecting anomalies from sequence of log events with few samples. LoganMeta train a hybrid few-shot classifier in an episodic manner. The experimental results demonstrate the efficacy of our proposed method
translated by 谷歌翻译
Egocentric 3D human pose estimation with a single head-mounted fisheye camera has recently attracted attention due to its numerous applications in virtual and augmented reality. Existing methods still struggle in challenging poses where the human body is highly occluded or is closely interacting with the scene. To address this issue, we propose a scene-aware egocentric pose estimation method that guides the prediction of the egocentric pose with scene constraints. To this end, we propose an egocentric depth estimation network to predict the scene depth map from a wide-view egocentric fisheye camera while mitigating the occlusion of the human body with a depth-inpainting network. Next, we propose a scene-aware pose estimation network that projects the 2D image features and estimated depth map of the scene into a voxel space and regresses the 3D pose with a V2V network. The voxel-based feature representation provides the direct geometric connection between 2D image features and scene geometry, and further facilitates the V2V network to constrain the predicted pose based on the estimated scene geometry. To enable the training of the aforementioned networks, we also generated a synthetic dataset, called EgoGTA, and an in-the-wild dataset based on EgoPW, called EgoPW-Scene. The experimental results of our new evaluation sequences show that the predicted 3D egocentric poses are accurate and physically plausible in terms of human-scene interaction, demonstrating that our method outperforms the state-of-the-art methods both quantitatively and qualitatively.
translated by 谷歌翻译
The coverage of different stakeholders mentioned in the news articles significantly impacts the slant or polarity detection of the concerned news publishers. For instance, the pro-government media outlets would give more coverage to the government stakeholders to increase their accessibility to the news audiences. In contrast, the anti-government news agencies would focus more on the views of the opponent stakeholders to inform the readers about the shortcomings of government policies. In this paper, we address the problem of stakeholder extraction from news articles and thereby determine the inherent bias present in news reporting. Identifying potential stakeholders in multi-topic news scenarios is challenging because each news topic has different stakeholders. The research presented in this paper utilizes both contextual information and external knowledge to identify the topic-specific stakeholders from news articles. We also apply a sequential incremental clustering algorithm to group the entities with similar stakeholder types. We carried out all our experiments on news articles on four Indian government policies published by numerous national and international news agencies. We also further generalize our system, and the experimental results show that the proposed model can be extended to other news topics.
translated by 谷歌翻译
Many problems can be viewed as forms of geospatial search aided by aerial imagery, with examples ranging from detecting poaching activity to human trafficking. We model this class of problems in a visual active search (VAS) framework, which takes as input an image of a broad area, and aims to identify as many examples of a target object as possible. It does this through a limited sequence of queries, each of which verifies whether an example is present in a given region. We propose a reinforcement learning approach for VAS that leverages a collection of fully annotated search tasks as training data to learn a search policy, and combines features of the input image with a natural representation of active search state. Additionally, we propose domain adaptation techniques to improve the policy at decision time when training data is not fully reflective of the test-time distribution of VAS tasks. Through extensive experiments on several satellite imagery datasets, we show that the proposed approach significantly outperforms several strong baselines. Code and data will be made public.
translated by 谷歌翻译
We present XKD, a novel self-supervised framework to learn meaningful representations from unlabelled video clips. XKD is trained with two pseudo tasks. First, masked data reconstruction is performed to learn modality-specific representations. Next, self-supervised cross-modal knowledge distillation is performed between the two modalities through teacher-student setups to learn complementary information. To identify the most effective information to transfer and also to tackle the domain gap between audio and visual modalities which could hinder knowledge transfer, we introduce a domain alignment strategy for effective cross-modal distillation. Lastly, to develop a general-purpose solution capable of handling both audio and visual streams, a modality-agnostic variant of our proposed framework is introduced, which uses the same backbone for both audio and visual modalities. Our proposed cross-modal knowledge distillation improves linear evaluation top-1 accuracy of video action classification by 8.4% on UCF101, 8.1% on HMDB51, 13.8% on Kinetics-Sound, and 14.2% on Kinetics400. Additionally, our modality-agnostic variant shows promising results in developing a general-purpose network capable of handling different data streams. The code is released on the project website.
translated by 谷歌翻译
The vision community has explored numerous pose guided human editing methods due to their extensive practical applications. Most of these methods still use an image-to-image formulation in which a single image is given as input to produce an edited image as output. However, the problem is ill-defined in cases when the target pose is significantly different from the input pose. Existing methods then resort to in-painting or style transfer to handle occlusions and preserve content. In this paper, we explore the utilization of multiple views to minimize the issue of missing information and generate an accurate representation of the underlying human model. To fuse the knowledge from multiple viewpoints, we design a selector network that takes the pose keypoints and texture from images and generates an interpretable per-pixel selection map. After that, the encodings from a separate network (trained on a single image human reposing task) are merged in the latent space. This enables us to generate accurate, precise, and visually coherent images for different editing tasks. We show the application of our network on 2 newly proposed tasks - Multi-view human reposing, and Mix-and-match human image generation. Additionally, we study the limitations of single-view editing and scenarios in which multi-view provides a much better alternative.
translated by 谷歌翻译
The use of multilingual language models for tasks in low and high-resource languages has been a success story in deep learning. In recent times, Arabic has been receiving widespread attention on account of its dialectal variance. While prior research studies have tried to adapt these multilingual models for dialectal variants of Arabic, it still remains a challenging problem owing to the lack of sufficient monolingual dialectal data and parallel translation data of such dialectal variants. It remains an open problem on whether the limited dialectical data can be used to improve the models trained in Arabic on its dialectal variants. First, we show that multilingual-BERT (mBERT) incrementally pretrained on Arabic monolingual data takes less training time and yields comparable accuracy when compared to our custom monolingual Arabic model and beat existing models (by an avg metric of +$6.41$). We then explore two continual pre-training methods-- (1) using small amounts of dialectical data for continual finetuning and (2) parallel Arabic to English data and a Translation Language Modeling loss function. We show that both approaches help improve performance on dialectal classification tasks ($+4.64$ avg. gain) when used on monolingual models.
translated by 谷歌翻译